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Abstract 

Within the natural language processing 
(NLP) community, active learning has 
been widely investigated and applied in or¬ 
der to alleviate the annotation bottleneck 
faced by developers of new NLP systems 
and technologies. This paper presents the 
first theoretical analysis of stopping active 
learning based on stabilizing predictions 
(SP). The analysis has revealed three ele¬ 
ments that are central to the success of the 
SP method: (1) bounds on Cohen’s Kappa 
agreement between successively trained 
models impose bounds on differences in 
F-measure performance of the models; (2) 
since the stop set does not have to be la¬ 
beled, it can be made large in practice, 
helping to guarantee that the results trans¬ 
fer to previously unseen streams of ex¬ 
amples at test/application time; and (3) 
good (low variance) sample estimates of 
Kappa between successive models can be 
obtained. Proofs of relationships between 
the level of Kappa agreement and the dif¬ 
ference in performance between consecu¬ 
tive models are presented. Specifically, if 
the Kappa agreement between two mod¬ 
els exceeds a threshold T (where T > 0), 
then the difference in F-measure perfor¬ 
mance between those models is bounded 
above by in all cases. If precision 

of the positive conjunction of the models 
is assumed to be p, then the bound can be 
tightened to 

1 Introduction 

Active learning (AL), also called query learning 
and selective sampling, is an approach to reduce 
the costs of creating training data that has received 
considerable interest (e.g., ([Argamon-Engelson 


and Pagan, 1999t [Baldridge and Osborne, 2008| 


Bloodgood and Vijay-Shanker, 2009b [Bloodgood 


and Callison-Burch, 2010[ Hachey et al., 2005| 


Haertel et al., 2008 Haffari and Sarkar, 2009 


Hwa, 2000 Lewis and Gale, 1994 Sassano, 


|2002[[Settles and Craven, 2008t[Shen et al., 2004 


Thompson et al., 1999 Tomanek et al., 2007 Zhu 


and Hovy, 20071). 


Within the NLP community, active learning has 
been widely investigated and applied in order to 
alleviate the annotation bottleneck faced by devel¬ 
opers of new NLP systems and technologies. The 
main idea is that by judiciously selecting which 
examples to have labeled, annotation effort will be 
focused on the most helpful examples and less an¬ 
notation effort will be required to achieve given 
levels of performance than if a passive learning 
policy had been used. 

Historically, the problem of developing meth¬ 
ods for detecting when to stop AL was tabled for 
future work and the research literature was fo¬ 
cused on how to select which examples to have la¬ 


beled and analyzing the selection methods (Cohn 


etal, 199^ Seung et al, 1992[[Freund et al., 1997 


Roy and McCallum, 20011. However, to realize 


the savings in annotation effort that AL enables, 
we must have a method for knowing when to stop 
the annotation process. The challenge is that if we 
stop too early while useful generalizations are still 
being made, then we can wind up with a model 
that performs poorly, but if we stop too late after 
all the useful generalizations are made, then hu¬ 
man annotation effort is wasted and the benefits of 
using active learning are lost. 

Recently research has begun to develop meth¬ 


ods for stopping AL (Schohn and Cohn, 2000 


Ertekin et al., 2007b Ertekin et al., 2007a Zhu 

and Hovy, 2007 

Eaws and Schiitze, 2008 

Zhu 


[et al, 2008at [Zhu et al., 2008b[ [Vlachos, 2008 


Bloodgood, 2009[ [Bloodgood and Vijay-Shankel^ 


2009a| Ghayoomi, 20101. The methods are all 
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heuristics based on estimates of model confidence, 
error, or stability. Although these heuristic meth¬ 
ods have appealing intuitions and have had ex¬ 
perimental success on a small handful of tasks 
and datasets, the methods are not widely usable in 
practice yet because our community’s understand¬ 
ing of the stopping methods remains too coarse 
and inexact. Pushing forward on understanding 
the mechanics of stopping at a more exact level 
is therefore crucial for achieving the design of 
widely usable effective stopping criteria. 


Bloodgood and Vijay-Shanker (2009a I intro¬ 


duce the terminology aggressive and conserva¬ 
tive to describe the behavior of stopping meth¬ 
od^ and conduct an empirical evaluation of the 
different published stopping methods on several 
datasets. While most stopping methods tend to 
behave conservatively, stopping based on stabiliz¬ 
ing predictions computed via inter-model Kappa 
agreement has been shown to be consistently ag¬ 
gressive without losing performance (in terms of 
F-Measure|^ in several published empirical tests. 
This method stops when the Kappa agreement be¬ 
tween consecutively learned models during AL 
exceeds a threshold for three consecutive itera¬ 
tions of AL. Although this is an intuitive heuristic 
that has performed well in published experimental 
results, there has not been any theoretical analysis 
of the method. 

The current paper presents the first theoretical 
analysis of stopping based on stabilizing predic¬ 
tions. The analysis helps to explain at a deeper 
and more exact level why the method works as it 
does. The results of the analysis help to character¬ 
ize classes of problems where the method can be 
expected to work well and where (unmodified) if 
will nol be expected fo work as well. The fheory 
is suggesfive of modificalions fo improve fhe ro- 
busfness of fhe slopping mefhod for cerfain classes 
of problems. And perhaps mosf imporfanl, fhe 
approach lhal we use in our analysis provides an 
enabling framework for more precise analysis of 
slopping criteria and possibly ofher parfs of fhe ac¬ 
tive learning decision space. 

In addilion, fhe information presented in Ihis pa- 


* Aggressive methods stop sooner, aggressively trying to 
reduce unnecessary annotations while conservative methods 
are careful not to risk losing model performance, even if it 
means annotating many more examples than were necessary. 

^For the rest of this paper, we will use F-measure to de¬ 
note Ft-measure, that is, the balanced harmonic mean of pre¬ 
cision and recall, which is a standard metric used to evaluate 
NLP systems. 


per is useful for works lhal consider swilching be- 
Iween differenl acfive learning slralegies and oper- 
aling regions such as (Baram et ah, 2004} Ddnmez 


et ah, 2007 Roth and Small, 20081. Knowing 


when to switch strategies, for example, is sim¬ 
ilar to the stopping problem and is another set¬ 
ting where detailed understanding of the variance 
of stabilization estimates and their link to perfor¬ 
mance ramifications is useful. More exact un¬ 
derstanding of the mechanics of stopping is also 


useful for applications of co-training (Blum and 


Mitchell, 19981, and agreement-based co-training 
(Clark et ah, 200311 in particular. Finally, the 


proofs of the Theorems regarding the relationships 
between Cohen’s Kappa statistic and F-measure 
may be of broader use in works that consider inter¬ 
annotator agreement and its ramifications for per¬ 
formance appraisals, a topic that has been of long¬ 


standing interest in computational linguistics (Car- 
|letta, 199^|Artstein and Poesio, 20()8] l. 


In the next section we summarize the stabiliz¬ 
ing predictions (SP) stopping method. Section 
analyzes SP and Section [^concludes. 

2 Stopping Active Learning based on 
Stabilizing Predictions 

The intuition behind the SP method is that the 
models learned during AL can be applied to a large 
representative set of unlabeled data called a stop 
set and when consecutively learned models have 
high agreement on their predictions for classify¬ 
ing the examples in the stop set, this indicates that 


it is time to stop (Bloodgood and Vijay-Shanker, 


2009a| Bloodgood, 20091. The active learning 


stopping strategy explicitly examined in (Blood 


good and Vijay-Shanker, 2009a I (after the general 
form is discussed) is to calculate Cohen’s Kappa 
agreement statistic between consecutive rounds of 
active learning and stop once it is above 0.99 for 
three consecutive calculations. 

Since the Kappa statistic is an important as¬ 
pect of this method, we now discuss some back¬ 
ground regarding measuring agreement in general, 
and Cohen’s Kappa in particular. Measurement 
of agreement between human annotators has re¬ 
ceived significant attention and in that context, 
the drawbacks of using percentage agreement have 
been recognized ( [Artstein and Poesio, 2008 1. Al¬ 
ternative metrics have been proposed that take 
chance agreement into account. [Artstein and Poe¬ 
sio (2008]) survey several agreement metrics. Most 






























of the agreement metrics they discuss are of the 
form: 

Ao - Ae 

agreement = —- - —, (1) 

1 A.^ 

where Aq = observed agreement, and Ae = agree¬ 
ment expected by chance. The different metrics 
differ in how they compute Ae. All the instances 
of usage of an agreement metric in this article will 
have two categories and two coders. The two cat¬ 
egories are “- 1 - 1 ” and “-1” and the two coders are 
the two consecutive models for which agreement 

is being measured. _ 

Cohen’s Kappa statisticp] ( |Cohen, 1960 1 mea¬ 
sures agreement expected by chance by modeling 
each coder (in our case model) with a separate dis¬ 
tribution governing their likelihood of assigning a 
particular category. Formally, Kappa is defined by 
Equation [T] with Ae computed as follows: 




P{k\ci)-P{k\e 2 ), 

k£{+l,-l} 


( 2 ) 


where each Ci is one of the coders (in our case, 
models), and P{k\ci) is the probability that coder 
(model) Ci labels an instance as being in category 
k. Kappa estimates the P{k\ci) in Equation]^ 
based on the proportion of observed instances that 
coder (model) q labeled as being in category k. 


3 Analysis 
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Table 1: Contingency table population probabili¬ 
ties for Mt (model learned at iteration t) and Mt-i 
(model learned at iteration t-1). 

population probability mj for i,j G is the 

probability of an example being placed in category 
i by model Mt-i and category j by model Mt, 
population probability tt.j for j G {+, —} is the 
probability of an example being placed in category 
j by model Mt, and population probability tt*. for 
i G is the probability of an example being 

placed in category i by model Mt-i. The actual 

probability of agreement is tTq = 7r+_(_ -|- vr_As 

indicated in Equation Kappa models the prob¬ 
ability of agreement expected due to chance by 
assuming that classifications are made indepen¬ 
dently. Hence, the probability of agreement ex¬ 
pected by chance in terms of the population prob¬ 
abilities is vTe = 7r_|_,7r.+-|-7r_,7r,_. Erom the defini¬ 
tion of Kappa (see Equation [^, we then have that 
the Kappa parameter K in terms of the population 
probabilities is given by 


This section analyzes the SP stopping method. 
Section 3.1 analyzes the variance of the estima¬ 
tor of Kappa that SP uses and in particular the re¬ 
lationship of this variance to specific aspects of 
the operationalization of SP, such as the stop set 
size. Section |3.2| analyzes relationships between 
the Kappa agreement between two models and the 
difference in E-measure between those two mod¬ 
els. 


3.1 Variance of Kappa Estimator 

SP bases its decision to stop on the information 
contained in the contingency tables between the 
classifications of models learned at consecutive 
iterations during AE. In determining whether to 
stop at iteration t, the classifications of the current 
model Mt are compared with the classifications of 
the previous model Mt-i. Tableshows the pop¬ 
ulation parameters for these two models, where: 

^We note that there are other agreement measures (beyond 
Cohen’s Kappa) which could also be applicable to stopping 
based on stabilizing predictions, but an analysis of these is 
outside the scope of the current paper. 


K = 


'^o '^e 

1 - TTe ' 


(3) 


Eor practical applications we will not know the 
true population probabilities and we will have to 
resort to using sample estimates. The SP method 
uses a stop set of size n for deriving its estimates. 
Table shows the contingency table counts for 
the classifications of models Mt and Mt-i on a 
sample of size n. The population probabilities ntj 
can be estimated by the relative frequencies ptj for 
ij e {+,-,•}. where: = a/n-,p+- = b/m, 

p _= c/n; p _= d/m, p+. = (o -|- 6)/n; p_, = 

{c + d)/n', P.+ = (a-|-c)/n; andp._ = {c + d)/n. 

Eet po = P++ + P _> the observed proportion of 

agreement and let pe = P+.P.+ + P-.P.-, the pro¬ 
portion of agreement expected by chance if we as¬ 
sume that Mt and Mt-i make their classifications 
independently. Then the Kappa measure of agree¬ 
ment K between Mt and Mt-i (see Equationis 
estimated by 


k = 


Po-Pe 

f-Pe' 


(4) 
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c 
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c + d 

Total 

a + c b + d 
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Table 2: Contingency table counts for Mt (model 
learned at iteration t) and Mt-i (model learned at 
iteration t-1). 


Using the delta method, as described in (Bishop 


|et al., 1975| ), |Fleiss et al. (1969| l derived an estima¬ 
tor of the large-sample variance of K. According 
to Hale and Fleiss (19931, the estimator simplifies 
to 


1 

Var{K) = — - 

n(l -peY 

I ^ pii[l - Api{l - k)] 

*e{-r,-} ( 5 ) 

-{k-pYi-k)f + {i-kfx 

^ Pij[2{pi+Pj) - {Pi.+P.j)fy 


where Pi = (p*. -|-p.j)/2. From Equation we 
can see that the variance of our estimate of Kappa 
is inversely proportional to the size of the stop set 
we use. 

Bloodgood and Vijay-Shanker (2009a I used a 
stop set of size 2000 for each of their datasets. 
Although this worked well in the results they re¬ 
ported, we do not believe that 2000 is a fixed size 
fhaf will work well for all fasks and dafasefs where 
fhe SP mefhod could be used. Table |3] shows 
fhe variances of K compufed using Equation 
al fhe poinls al which SP slopped AE for each of 
Ihe dalasel^from (Bloodgood and Vijay-Shanker,| 
2009al ). 


These variances indicate lhal fhe size of 2000 
was typically sufficienl lo gel lighl estimates of 
Kappa, helping lo illuminate Ihe empirical success 
of Ihe SP melhod on Ihese dalasels. More gener¬ 
ally, Ihe SP melhod can be augmented wilh a vari¬ 
ance check: if Ihe variance of estimated Kappa al 
a potential slopping poinl exceeds some desired 


"'We note that each of the datasets was set up as a binary 
classification task (or multiple binary classification tasks). 
Further details and descriptions of each of the datasets can 
be found in (Bloodgood and Vijay-Shanker, 2009a'. 


Ihreshold, Ihen Ihe slop sel size can be increased 
as needed lo reduce Ihe variance. 

Booking al Equation again, one can note lhal 
when pe is relatively close lo 1, Ihe variance of K 
can be expected lo gel quite large. In Ihese silu- 
alions, users of SP should expecl lo have lo use 
larger slop sel sizes and in exlreme conditions, SP 
may nol be an advisable melhod lo use. 


3.2 Relationship between Kappa agreement 
and change in performance between 
models 

Heretofore, the published literature contained only 
informal explanations of why stabilizing predic¬ 
tions is expected to work well as a stopping 
method (along with empirical tests demonstrat¬ 
ing successful operation on a handful of tasks and 
datasets). In the remainder of this section we 
describe the mathematical foundations for stop¬ 
ping methods based on stabilizing predictions. In 
particular, we will prove that even in the worst 
possible case, if the Kappa agreement between 
two subsequently learned models is greater than 
a threshold T, then it must be the case that the 
change in performance between these two models 
is bounded above by We then go on to 

prove additional Theorems that tighten this bound 
when assumptions are made about model preci¬ 
sion. 

Lemma 3.1 Suppose F-measure F and Kappa K 
are computed from the same contingency table of 
counts, such as the one given in Table Suppose 
ad — be > 0. Then F > K. 

Proof By definition, in terms of the contingency 
table counts. 


K = 


2ad — 2bc 

{a + b){b + d) + {a + c)(c -F d) 


( 6 ) 


and 


F = 


2a 


(V) 


2a b -\- c 

Rewriting F so that it will have the same numera¬ 
tor as K, we have: 


F 



2a + b + c/\d-^ / 


( 8 ) 

( 9 ) 


2ad — 2bc 
2ad + bd + cd — 2bc 


b^c+bc^ ' 































Task-Dafasef 

Variance of K 

NER-DNA( 10-fold CV) 

0.0000223 

NER-cellType (10-fold CV) 

0.0000211 

NER-profein (10-fold CV) 

0.0000074 

Reuters (10 Categories) 

0.0000298 

20 Newsgroups (20 Categories) 

0.0000739 

WebKB Sfudenf (10-fold CV) 

0.0000137 

WebKB Projecf (10-fold CV) 

0.0000190 

WebKB Faculfy (10-fold CV) 

0.0000115 

WebKB Course (10-fold CV) 

0.0000179 

TC-spamassassin (10-fold CV) 

0.0000042 

TC-TREC-SPAM (10-fold CV) 

0.0000043 

Average (macro-avg) 

0.0000209 


Table 3: Estimates of the variance of K. For each dataset, the estimate of the variance of K is computed 
(using Equation from the contingency table at the point at which SP stopped AE and the average of 
all the variances (across all folds of CV) is displayed. The last row contains the macro-average of the 
average variances for all the datasets. 


We can see fhaf fhe expression for F in Equa¬ 
tion has fhe same numerator as K in Equa- 

Mt-i 

Mi 

-1- 


Tofal 

fionj^buf fhe denominafor of K in Equafionj^is > 

-1- 

at 

bi 

ai bi 

fhe denominafor of F in Equation Therefore, 

- 

Cl 

di 

Cl -1- di 

F>K. 1 

Tofal 

Ol -1- Cl 

bi di 

ni 


Theorem 3.2 Let Mt be the model learned at iter¬ 
ation t of active learning and Mt-i be the model 
learned at iteration t — 1. Let Kt be the estimate 
of Kappa agreement between the classifications of 
Mt and Mt-i on the examples in the stop set. Let 
Ft be the F-measure between the classifications of 
Mt and truth on the stop set. Let Ft-i be the F- 
measure between the classifications of Mt-i and 
truth on the stop set. Let AFt be Ft — Ft-i. Sup¬ 
pose T > 0. Then Kt > T ^ 


Proof Suppose Mt, Mt-i, Kt, Ft, Ft-i, AFt, 
and T are defined as stated in the statement of 


Theorem |3.2| Eet Ft be the F-measure between 
the classifications of Mt and Mt-i on the exam¬ 
ples in the stop set. Eet Table show the con¬ 
tingency table counts for Mt versus Mt-i on the 
examples in the stop set. Then, from their defi- 

and 


nitions, we have Kt = 


2{ad—bc) 


Ft = 


2a 


(a+f))(6+d)+(a+c)(c+d) 


There exist true labels for the ex- 


2a+6+c' 

amples in the stop set, which we don’t know since 
the stop set is unlabeled, but nonetheless must ex¬ 
ist. We use the truth on the stop set to split Table 
into two subtables of counts, one table for all the 
examples that are truly positive and one table for 
all the examples that are truly negative. Table 


Table 4: Contingency table counts for Mt (model 
learned at iteration t) versus Mt-i (model learned 
at iteration t-1) for only the examples in the stop 
set that have truth = -i-l. 



Mt 


Mt-i 

■+ 

Tofal 

-1- 

0-1 

b-i 

o_i -1- b-i 

- 

C-l 

d-i 

C-l -1- d-i 

Tofal 

0-1 -1- C 1 

b—\ -\- d—\ 

n i 


Table 5: Contingency table counts for Mt (model 
learned at iteration t) versus Mt-i (model learned 
at iteration t-1) for only the examples in the stop 
set that have truth = -1. 


shows the contingency table for Mt versus Mt-i 
for all of the examples in the stop set that have true 
labels of -i-l and Tablej^shows the contingency ta¬ 
ble for Mt versus Mt-i for all of the examples in 
the stop set that have true labels of -1. 

From Tables and one can see that a is 
the number of examples in the stop set that both 
Mt and Mt-i classified as positive. Furfhermore, 
ouf of fhese a examples, ai of fhem fruly are pos- 






























It follows that 



Mt 


Truth 
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Total 

- 1 - 

ai + Cl 

61 -4 di 

ni 
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a i -4 c i 

6—1 -4 d—1 

n-i 

Total 

a -4 c 

6 + d 
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Table 6 : Contingency table counts for Mt (model 
learned at iteration t) versus truth. (Derived from 
Tables |4] and [5] 



Mt-i 


Truth 
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Total 

- 1 - 

ai + 61 

Cl -4 di 

ni 

- 

a-i -4 6-1 

c-i -4 d-i 

re -1 

Total 

a -4 6 c -4 d 

re 


Table 7: Contingency table counts for Mt-i 
(model learned at iteration t-1) versus truth. (De¬ 
rived from Tables |4] and |5] 


itive and a_i of them ttuly are negative. Similar 
explanations hold for the other counts. Also, from 
Tables and one can see that the equalities 
a = oi + a_i, b = bi + 6_i, c = ci + c_i, and 


d = di + d-i all hold. The contingency tables 
for Mt versus truth and Mt-i versus truth can be 
derived from Tables and For convenience, Ta¬ 
ble shows the contingency table for Mt versus 
truth and Table shows the contingency table for 
Mt-i versus truth. Suppose that Kt > T. This 
implies, by Lemma 3.1^ that Ft > T. This im¬ 
plies that 


2a 


> T 


2a+6+c 

2a > ( 2 a + b + c)T 
2 a(l — T) > {h + c)T 
b + c<^-^. 


( 11 ) 

( 12 ) 

(13) 

(14) 


Note that Equations and are justified since 
2a -|- 6 -|- c > 0 and T > 0, respectively. 

From Table [ 6 ] we can see that 


Ft = 


2(ffli-|-ei) 


2(ai-t-ci 1 +c_i ’ 

we can see that Ft-i = ■ 

For notational convenience, let: g = 


from Table 

2(ai-|-fei) 


0 


2(ai + Cl) -\- b\ -\- d\ -\- a_i -|- c_i; and 
h = 2(ai -|- 6i) -|- Cl -|- di -|- a_i -|- b—i. 


^Note that the condition ad — be > 0 of Lemma 3.1 


met since Kt > T and T > 0 imply Kt > 0, which in turn 
implies ad — be > 0. 


2(ai 4- Cl) 2(ai -f 6i) 

9 h 

(2ai -|- 2ci)h — (2ai -|- 2bi)g 
gh 


(15) 

(16) 


For notational convenience, let: x = 2(aici -|- 
aib-i + cf + cidi + cia_i -|- ci 6 _i); and y = 
2(ai6i -I- aic_i + b^ + bidi + bia-i + bic-i). 

Then picking up from Equation it follows that 

AFt = ^ (17) 

gh 

2 [ai-f cia 2 - 61 ^ 3 ] 

—-r-) (lo) 

gh 

where ui = aici — ai 6 i -|- ai 6 _i — aic_i, U 2 = 
Cl+ di-|-a_i 4 - 6 - 1 , andua = 6i4-di4-a_i4-c_i. 

For notational convenience, let: dA = ci — 61 
and ds = c_i — 6 - 1 . Then it follows that 


AFt = 


2u4 
gh ’ 


(19) 


where: 114 = ai(dA — d^) 4 - dA(di 4 - a_i 4 - 61 4 - 
ci) -4 Cl6-1 - 6ic_i. 

Noting that g = h + dA + ds, have 

2a4 


AFt = 


h{h 4“ dA 4“ ds) 


( 20 ) 


Noting that 2a4 = 2 [dyi(ai 4 - 61 4 - ci 4 - di 4 - 
0-1 4 - 6_i) — dsiai + 61)] and letting a5 = ai 4 - 
61 4 - Cl 4 - di 4 - a_i 4 - 6_i, we have 


_ 2 [dAU 5 - dsiai + 61)] 
6.(6. 4“ dA 4“ du) 


Therefore, 


( 21 ) 


\AFt\ < 2 


dAU5 


-4 


6(6 4“ dA 4“ ds) 
ds(ai -4 61) 


6(6 -4 dA “4 ds) 


( 22 ) 


Recall that 6 -4 c = 61 -4 6 _i -4 ci -4 c_i. Then 
observe that the following three inequalities hold: 
64-c > dA', b + c> ds', and 6(6-46^4-65) > 0 . 
Therefore, 


AFt\ < 


< 


2(bH-c) [2aiH-2bi+ci+di+a_i+6_i] 
h{h-\-dA+dB) 

2{b-\-c)h 

h{h-\-dA+dB) 

2{b+c) 

h-\-dA~\~dB 

2(2a)(l-T) 

T{h-\-dA~\~dB) 

/4(^w_ a _ 

\ T / \ h-\-dA~\~dB ' 


(23) 

(24) 

(25) 

(26) 
(27) 








































Observe that h + dA + ds = 2oi + 61 + 2ci + di + 
a_i + C-i- Therefore, < 1. Therefore, 

we have 


IAF^I < 


4(1 -T) 


(28) 


Note that in deriving Inequality ^6 we used 

Also, the 


the previously derived Inequality 14 


proof of Theorem |3.2| assumes a worst possible 
case in the sense that all examples where the clas¬ 
sifications of Mt and Mt-i differ are assumed 
to have truth values that all serve to maximize 
one model’s F-measure and minimize the other 
model’s F-measure so as to maximize |AF)| as 
much as possible. A resulting limitation is that the 
bound is loose in many cases. It may be possible 
to derive tighter bounds, perhaps by easing off to 
an expected case instead of a worst case and/or by 
making additional assumptions]^ 

Taking this possibility up, we now prove tighter 
bounds when assumptions about the precision of 
the models Mt and Mt-i are made. Consider that 


in the proof of Theorem 3.2 when transitioning 
from Equality 27 to Inequality 28 we used the 
fact that 


a 


< 1. Note that 


h-\-dA-\-dB 

2 ai+bi+2ci+di+a_i+c_i » ^om which One sees that 
h+dA+dg = ^ ai,bi,ci,di and c_i 

are all zero. This is a pathological case. In many 
practically important classes of cases to consider, 
h+dA+dg Strictly less than I, and often sub¬ 

stantially less than I. The following two Theorems 
prove tighter bounds on \AFt\ than Theorem 3.2 
by utilizing this insight. 

Theorem 3.3 Suppose Mt, Mt-i, Kt, Ft, Ft-i, 
AFt, and T are defined as stated in the statement 
of Theorem |3.2| Let the contingency tables be de¬ 
fined as they were in the proof of Theorem \3.3\ Let 
^PositiveConjunction be a model that only clas¬ 
sifies an example as positive if both models Mt 
and Mt-i classify the example as positive. Sup¬ 
pose that MpositiveConjunction has perfect preci¬ 
sion on the stop set, or in other words that every 
single example from the stop set that both Mt and 
Mt-i classify as positive is truthfully positive (i.e., 
a_i = 0). Then Kt > T ^ |AFt| < 


Proof The proof of Theorem 3.2 holds exactly 
as it is up until Equality 27 Now, using the 


additional assumption that a_i = 0 , we have 


®If one is planning to undertake this challenge, we would 
suggest further consideration of Inequalities and 


h-\-dj\-\-dB 2 


< i Therefore, we have 


, 2(1-T) 

|AFi| < I 


( 29 ) 


Theorem 3.3 is a special case (in the limit) of 
a more general Theorem. Before stating and prov¬ 
ing the more general Theorem, we prove a Eemma 
that will be helpful in making the proof of the gen¬ 
eral Theorem clearer. 

Lemma 3.4 Let f, dA, dp and contingency ta¬ 
ble counts be defined as they were in the proof 
of Theorem 3.2 Suppose ai = xa_i. Then 


< 


X+1 


h-\-dj^-\-dB — 2fc-|-l* 

Proof ai = xa-i by hypothesis, a = oi -|- a_i 
by definition of contingency table counts. Hence, 
a = (x -h I)a_i. Therefore, 


< 


(30) 


__ (a:-fl)a-l 

h + dA + dB~ 2xa_i+a_i 

_ (x-|-l)a_i 

~ (2x+l)a_i 

— a:-ft ■ 

2a:-|-l • ■ 


The following Theorem generalizes Theo¬ 
rem to cases when MposiUveC on junction has 
precision p in ( 0 , Dfl 

Theorem 3.5 Suppose Mt, Mt-i, Kt, Ft, Ft-i, 
AFt, and T are defined as stated in the statement 
ofTheorem \3.2\ Let the contingency tables be de¬ 
fined as they were in the proof of Theorem \3.2\ Let 
^PositiveC on junction be a model that only classi¬ 
fies an example as positive if both models Mt and 
Mt-i classify the example as positive. Suppose 
that MpQgitiveConjunction has precision p on the 
stop set. Then Kt > T 


\AFt\ < 


0 - (p-fl)T- 

Proof The proof of Theorem |3.2| holds exactly as 
it is up until Equality]^ MposiUveConjuncUon has 
precision p on the stop set =l> p = Solv¬ 

ing for oi in terms of a_i we have ai = 


Therefore, applying Eemma 
have 


3.4 


with X = 


— jp_ 


1—p ’ 


we 


h-\-dj\+dB — 

i-p 


IAFJ < 


T^ + 1 

< —. Therefore we have 



4 (i-r) 

{p+t)T- 


28 as a possible starting point. 


^The case when p = 0 is handled by Theorem 
case when p = 1 is handled by Theorem 3.3 


3.2 


(31) 

(32) 

and the 









































Precision 

(to 3 decimal places) 

50% 

0.667 

80% 

0.556 

90% 

0.526 

95% 

0.513 

98% 

0.505 

99% 

0.503 

99.9% 

0.500 


Table 8: Values of the scaling factor from Theo- 
rem|3.5|for different precision values. 


The scaling factor in Theorem 


3.5 


how the precision of the conjunctive modet 


shows 

affects 


the bound. Theorem |3.2| had the scaling factor im¬ 
plicitly set to 1 in order to handle the pathologi¬ 
cal case where the positive conjunctive model has 


precision = 0. In Theorem 3.3 where the positive 


conjunctive model has precision = 1 on the exam¬ 
ples in the stop set, the scaling factor is set to 1/2. 
Theorem [3]^ generalizes the scaling factor so that 
it is a function of the precision of the positive con¬ 
junctive model. For convenience, Table shows 
the scaling factor values for a few different preci¬ 
sion values. 

The bounds in Theorems |3.2[ |3.3[ and |3.5| all 
bound the difference in performance on the stop 
set of two consecutively learned models Mt and 
Mt-i. An issue to consider is how connected the 
difference in performance on the stop set is to the 
difference in performance on a stream of applica¬ 
tion examples generated according to the popula¬ 
tion probabilities. Taking up this issue, consider 
that the proof of Theorems |3.2[|3.3[ and |3. 5 [ would 
hold as it is if we had used sample proportions in¬ 
stead of sample counts (this can be seen by simply 
dividing every count by n, the size of the stop set). 
Since the stop set is unbiased (selected at random 
from the population), as n approaches infinity, the 
sample proportions will approach the population 
probabilities and the difference between the dif¬ 
ference in performance between Mt and Mt-i on 
the stop set and on a stream of application exam¬ 
ples generated according to the population proba¬ 
bilities will approach zero. 


4 Conclusions 


and datasets. But the methods are not widely 
usable in practice yet because our community’s 
understanding of the stopping methods remains 
too inexact. Pushing forward on understanding 
the mechanics of stopping at a more exact level 
is therefore crucial for achieving the design of 
widely usable effective stopping criteria. 

This paper presented the first theoretical anal¬ 
ysis of stopping based on stabilizing predictions. 
The analysis revealed three elements that are cen¬ 
tral to the SP method’s success: (1) the sample es¬ 
timates of Kappa have low variance; (2) Kappa has 
tight connections with differences in F-measure; 
and (3) since the stop set doesn’t have to be la¬ 
beled, it can be arbitrarily large, helping to guar¬ 
antee that the results transfer to previously unseen 
streams of examples at test/application time. 

We presented proofs of relationships between 
the level of Kappa agreement and the difference in 
performance between consecutive models. Specif¬ 
ically, if the Kappa agreement between two mod¬ 
els is at least T, then the difference in F-measure 
performance between those models is bounded 
above by If precision of the positive con¬ 

junction of the models is assumed to be p, then the 
bound can be tightened to . 

The setup and methodology of the proofs can 
serve as a launching pad for many further inves¬ 
tigations, including: analyses of stopping; works 
that consider switching between different active 
learning strategies and operating regions; and 
works that consider stopping co-training, and es¬ 
pecially agreement-based co-training. Finally, the 
relationships that have been exposed between the 
Kappa statistic and F-measure may be of broader 
use in works that consider inter-annotator agree¬ 
ment and its interplay with system evaluation, a 
topic that has been of long-standing interest. 
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